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Abstract 

This paper investigates stochastic and adversarial combinatorial multi-armed ban¬ 
dit problems. In the stochastic setting under semi-bandit feedback, we derive 
a problem-specific regret lower bound, and discuss its scaling with the dimen¬ 
sion of the decision space. We propose ESCB, an algorithm that efficiently ex¬ 
ploits the structure of the problem and provide a finite-time analysis of its regret. 
ESCB has better performance guarantees than existing algorithms, and signifi¬ 
cantly outperforms these algorithms in practice. In the adversarial setting under 
bandit feedback, we propose CombEXP, an algorithm with the same regret scal¬ 
ing as state-of-the-art algorithms, but with lower computational complexity for 
some combinatorial problems. 


1 Introduction 

Multi-Armed Bandits (MAB) problems ffl constitute the most fundamental sequential decision 
problems with an exploration vs. exploitation trade-off. In such problems, the decision maker selects 
an arm in each round, and observes a realization of the corresponding unknown reward distribution. 
Each decision is based on past decisions and observed rewards. The objective is to maximize the 
expected cumulative reward over some time horizon by balancing exploitation (arms with higher 
observed rewards should be selected often) and exploration (all arms should be explored to learn 
their average rewards). Equivalently, the performance of a decision rule or algorithm can be mea¬ 
sured through its expected regret, defined as the gap between the expected reward achieved by the 
algorithm and that achieved by an oracle algorithm always selecting the best arm. MAB problems 
have found applications in many fields, including sequential clinical trials, communication systems, 
economics, see e.g. mm. 

In this paper, we investigate generic combinatorial MAB problems with linear rewards, as introduced 
in 0. In each round n > 1, a decision maker selects an arm M from a finite set M. C {0, l} d and 

receives a reward M T X(n) = 53^1 J W(ti). The reward vector X(n) £ Kl is unknown. 
We focus here on the case where all arms consist of the same number m of basic actions in the 
sense that ||M||i = to, VM £ M. After selecting an arm M in round n, the decision maker 
receives some feedback. We consider both (i) semi-bandit feedback under which after round n, for 
all * £ {1,..., d}, the component Xi(n ) of the reward vector is revealed if and only if M t = 1; (ii) 
bandit feedback under which only the reward M T X(n) is revealed. Based on the feedback received 
up to round n — 1, the decision maker selects an arm for the next round n, and her objective is to 
maximize her cumulative reward over a given time horizon consisting of T rounds. The challenge in 
these problems resides in the very large number of arms, i.e., in its combinatorial structure: the size 
of M could well grow as d m . Fortunately, one may hope to exploit the problem structure to speed 
up the exploration of sub-optimal arms. 

We consider two instances of combinatorial bandit problems, depending on how the sequence 
of reward vectors is generated. We first analyze the case of stochastic rewards, where for all 
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Table 1: Regret upper bounds for stochastic combinatorial optimization under semi-bandit feedback. 


i G {1, ■ •• ,d}, (Xi(n)) „>i are i.i.d. with Bernoulli distribution of unknown mean. The reward 
sequences are also independent across i. We then address the problem in the adversarial setting 
where the sequence of vectors X (n) is arbitrary and selected by an adversary at the beginning of 
the experiment. In the stochastic setting, we provide sequential arm selection algorithms whose per¬ 
formance exceeds that of existing algorithms, whereas in the adversarial setting, we devise simple 
algorithms whose regret have the same scaling as that of state-of-the-art algorithms, but with lower 
computational complexity. 


2 Contribution and Related Work 

2.1 Stochastic combinatorial bandits under semi-bandit feedback 

Contribution, (a) We derive an asymptotic (as the time horizon T grows large) regret lower bound 
satisfied by any algorithm (Theorem [Tji. This lower bound is problem-specific and tight: there 
exists an algorithm that attains the bound on all problem instances, although the algorithm might 
be computationally expensive. To our knowledge, such lower bounds have not been proposed in 
the case of stochastic combinatorial bandits. The dependency in m and d of the lower bound is 
unfortunately not explicit. We further provide a simplified lower bound (Theorem |2]i and derive its 
scaling in (m, d) in specific examples. 

(b) We propose ESCB (Efficient Sampling for Combinatorial Bandits), an algorithm whose re¬ 
gret scales at most as 0(y / m<iA~ 1 1 n log(T)) (Theorempb, where A m i n denotes the expected reward 
difference between the best and the second-best arm. ESCB assigns an index to each arm. The 
index of given arm can be interpreted as performing likelihood tests with vanishing risk on its av¬ 
erage reward. Our indexes are the natural extension of KL-UCB indexes defined for unstructured 
bandits 0- Numerical experiments for some specific combinatorial problems are presented in the 
supplementary material, and show that ESCB significantly outperforms existing algorithms. 

Related work. Previous contributions on stochastic combinatorial bandits focused on specific com¬ 
binatorial structures, e.g. m- sets J6j, matroids 0, or permutations (8j. Generic combinatorial prob¬ 
lems were investigated in (9j [1.0 lTlITZIl . The proposed algorithms, LLR and CUCB are variants 
of the UCB algorithm, and their performance guarantees are presented in Table |T] Our algorithms 
improve over LLR and CUCB by a multiplicative factor of sjm. 

2.2 Adversarial combinatorial problems under bandit feedback 

Contribution. We present algorithm CombEXP, whose regret is 

O (^Jm 3 T{d + m 1 / 2 )^ 1 )^ p^l^j, where p min = min ie[d] rr ^ R j Emgvw M* and A is 

the smallest nonzero eigenvalue of the matrix E [MM T ] when M is uniformly distributed over M 
(Theorem^. For most problems of interest m(d\)~ 1 = 0{ 1) JU and p~l n = O(poly((i)), so that 
CombEXP has 0(^/m 3 dT log (d/m)) regret. A known regret lower bound is VL(msfdT) lfj~3l . so 
the regret gap between CombEXP and this lower bound scales at most as rn 1 2 up to a logarithmic 
factor. 

Related work. Adversarial combinatorial bandits have been extensively investigated recently, 
see m and references therein. Some papers consider specific instances of these problems, e.g., 
shortest-path routing fl4l . m-sets jT5l , and permutations ED- For generic combinatorial problems, 

known regret lower bounds scale as ^y/mdTJ and Q [msfdT'j (if d > 2m) in the case of semi¬ 
bandit and bandit feedback, respectively lU3l . In the case of semi-bandit feedback, m proposes 
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Table 2: Regret of various algorithms for adversarial combinatorial bandits with bandit feedback. 
Note that for most combinatorial classes of interests, m(dX)~ 1 = 0(1) and /x“ in = 0(poly(d)). 


OSMD, an algorithm whose regret upper bound matches the lower bound. fTTl presents an algorithm 
with 0(myjdL ^ log(d/m)) regret where is the total reward of the best arm after T rounds. 

For problems with bandit feedback, AS proposes ComBand and derives a regret upper bound which 
depends on the structure of action set AT For most problems of interest, the regret under ComBand 
is upper-bounded by 0(yjm 3 dT log (d/m)). fl8l addresses generic linear optimization with bandit 
feedback and the proposed algorithm, referred to as EXP2 with John’s Exploration, has a 
regret scaling at most as 0(^/m 3 dT log (d/m)) in the case of combinatorial structure. As we show 
next, for many combinatorial structures of interest (e.g. m-sets, matchings, spanning trees), COMB- 
EXP yields the same regret as ComBand and EXP2 WITH JOHN’S EXPLORATION, with lower 
computational complexity for a large class of problems. Table [2] summarises known regret bounds. 

Example 1: m-sets. Ad is the set of all d-dimensional binary vectors with m non-zero coordinates. 
We have /i m i n = ™ and A = (refer to the supplementary material for details). Hence when 

m = o(d), the regret upper bound of CombEXP becomes 0(^/m 3 dT log (d/m)), which is the 
same as that of ComBand and EXP2 with John’s Exploration. 

Example 2: matchings. The set of arms Ad is the set of perfect matchings in /C„ v „,. d = m 2 and 
|Ad| = m\. We have // lnjn = —, and A = —; ( . Hence the regret upper bound of CombEXP is 
0(y / m 5 Tlog(m)), the same as for ComBand and EXP2 with John’s Exploration. 

Example 3: spanning trees. Ad is the set of spanning trees in the complete graph K , ; y . In this 
case, d = (^), m = N — 1, and by Cayley’s formula Ad has N n ~ 2 arms, log < 2 N for 
N > 2 and ^ < 7 when N > 6, The regret upper bound of ComBand and EXP2 with John’s 

Exploration becomes 0(^/N 5 T\og(N)). As for CombEXP, we get the same regret upper 
bound 0(^/N 5 Tlog(N)). 


3 Models and Objectives 


We consider MAB problems where each arm M is a subset of m basic actions taken from [d] = 
{1,..., d}. For i £ [d], Xi(n) denotes the reward of basic action i in round n. In the stochastic 
setting, for each i , the sequence of rewards (X / (n)) n > 1 is i.i.d. with Bernoulli distribution with 
mean 9^. Rewards are assumed to be independent across actions. We denote by 9= (0 \ 1 £ 
0 = [0, l] rf the vector of unknown expected rewards of the various basic actions. In the adversarial 
setting, the reward vector X(n) = (Xi(n),... ,Xd(n)) T £ [0, l] d is arbitrary, and the sequence 
(X(n), n > 1) is decided (but unknown) at the beginning of the experiment. 

The set of arms Ad is an arbitrary subset of {0, l} d , such that each of its elements M has m basic 
actions. Arm M is identified with a binary column vector (Mi ,..., Md ) T , and we have ||M||i = 
m, VM £ Ad. At the beginning of each round n, a policy 7r, selects an arm M^(n) £ Ad based on 
the arms chosen in previous rounds and their observed rewards. The reward of arm M 71 (n) selected 
in round n is ^i( n )^i( n ) = M^(n) T X(n). 
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We consider both semi-bandit and bandit feedbacks. Under semi-bandit feedback and policy n, 
at the end of round n, the outcome of basic actions Xi(n) for all i £ M n (n) are revealed to the 
decision maker, whereas under bandit feedback, M n (ny X(n) only can be observed. 

Let II be the set of all feasible policies. The objective is to identify a policy in II maximizing the 
cumulative expected reward over a finite time horizon T. The expectation is here taken with respect 
to possible randomness in the rewards (in the stochastic setting) and the possible randomization in 
the policy. Equivalently, we aim at designing a policy that minimizes regret, where the regret of 
policy 7 T £ II is defined by: 


Bf (T) = max E 
MgM 


" T 

^M T X(n) 

_n =1 


-E 


" T 

^M» T A(n) 

_n—l 


Finally, for the stochastic setting, we denote by 9 m (9) = M T 9 the expected reward of arm M, 
and let M*(9 ) £ At , or M* for short, be any arm with maximum expected reward: M*( 6 ) £ 
argnraxMGAt 9 m (9). In what follows, to simplify the presentation, we assume that the optimal 
M * is unique. We further define: 9 *(9) = M* T 9, A m i n = miiiM^M* Am where A m = 
9*{0) - 9m{9), and A max = max M (/i*(0) - 9m (9)). 


4 Stochastic Combinatorial Bandits under Semi-bandit Feedback 
4.1 Regret Lower Bound 

Given 9 , define the set of parameters that cannot be distinguished from 9 when selecting action 
M*(9), and for which arm M*(9) is suboptimal: 

B(9) = {A g 0 : M*(9){9i - A*) = 0, Vi, /x*(A) > /r*(0)}. 

We define X = (R + )l jV| l and kl(u, v) the Kullback-Leibler divergence between Bernoulli distri¬ 
butions of respective means u and v, i.e., kl(u, v) = u\og(u/v) + (1 — w)log((l — u)/(l — v)). 
Finally, for (0, A) £ 0 2 , we define the vector kl(0, A) = (kl(0j, A I )), ie [ c /]. 

We derive a regret lower bound valid for any uniformly good algorithm. An algorithm 7 r is uniformly 
good iff IR(T) = o(T a ) for all a > 0 and all parameters 9 £ 0. The proof of this result relies on 
a general result on controlled Markov chains |H9| . 

Theorem 1 For all 9 £ 0, for any uniformly good policy n £ II, liminf 7’_ > , 00 > c(0), 

where c(9) is the optimal value of the optimization problem: 

ini x x m ( m *( 9)-M) t 9 s.t. ( JZ ^MM) T kl(0,A) > 1 , VA £ B{9). (1) 

Observe first that optimization problem ([3]) is a semi-infinite linear program which can be solved for 
any fixed 9, but its optimal value is difficult to compute explicitly. Determining how c(9) scales as 
a function of the problem dimensions d and m is not obvious. Also note that ([3} has the following 
interpretation: assume that ([3]) has a unique solution x*. Then any uniformly good algorithm must 
select action M at least x* M log(T) times over the T first rounds. From ffl9ll . we know that there 
exists an algorithm which is asymptotically optimal, so that its regret matches the lower bound of 
Theorem [T] However this algorithm suffers from two problems: it is computationally infeasible 
for large problems since it involves solving (j3j T times, furthermore the algorithm has no finite 
time performance guarantees, and numerical experiments suggests that its finite time performance 
on typical problems is rather poor. Further remark that if Ai is the set of singletons (classical 
bandit). Theorem [I] reduces to the Lai-Robbins bound l20l and if Ai is the set of m-sets (bandit 
with multiple plays), Theorem [I] reduces to the lower bound derived in @. Finally, Theorem[l]can 
be generalized in a straightforward manner for when rewards belong to a one-parameter exponential 
family of distributions (e.g., Gaussian, Exponential, Gamma etc.) by replacing kl by the appropriate 
divergence measure. 


4 







A Simplified Lower Bound We now study how the regret c(6) scales as a function of the problem 
dimensions d and m. To this aim, we present a simplified regret lower bound. Given 9, we say that 
a set H C M. \ M* has property P{9) iff, for all (M, M') £ TL 2 , M ^ M' we have M i M [( 1 — 
M*(9)) = 0 for all i. We may now state Theorem[2] 


Theorem 2 Let TL be a maximal (inclusion-wise) subset of M with property P(0). Define (3(9) = 
min m ^m* - Then: 


m > E 


m 


Men max ieM \M* 


kl 


fa, 





Corollary 1 Let 8 £ [a, 1 \ d for some constant a > 0 and A4 be such that each arm M £ A4, M ^ 
M* has at most k suboptimal basic actions. Then c(9) = fl(\TL\/k). 

Theorem [5] provides an explicit regret lower bound. Corollary [T] states that c(0) scales at least 
with the size of TL. For most combinatorial sets, \'H\ is proportional to d — m (see supplementary 
material for some examples), which implies that in these cases, one cannot obtain a regret smaller 
than 0((d — m)A“? n log(T)). This result is intuitive since d — m is the number of parameters 
not observed when selecting the optimal arm. The algorithms proposed below have a regret of 
0(dy/mAf i l n log(T)), which is acceptable since typically, y/m is much smaller than d. 


4.2 Algorithms 

Next we present ESCB, an algorithm for stochastic combinatorial bandits that relies on arm indexes 
as in UCB1 ED and KL-UCB 0. We derive finite-time regret upper bounds for ESCB that hold 
even if we assume that ||M||i < to, WM £ Ai, instead of ||iLf ||! = to, so that arms may have 
different numbers of basic actions. 


4.2.1 Indexes 


ESCB relies on arm indexes. In general, an index of arm M in round n, say ^(n), should be 
defined so that 6 m (n) > M T 8 with high probability. Then as for UCB1 and KL-UCB, applying the 
principle of optimism against uncertainty, a natural way to devise algorithms based on indexes is to 
select in each round the arm with the highest index. Under a given algorithm, at time n, we define 
ti(n) = Mi(s) the number of times basic action i has been sampled. The empirical mean 

reward of action i is then defined as 9i(n) = (1 /ti(n)) X^s=i Xi(s)Mi(s) if U(n) > 0 and 9i(n) = 
0 otherwise. We define the corresponding vectors t(n) = an d 9(n) = (0i(n))i e u]. 

The indexes we propose are functions of the round n and of 9(n). Our first index for arm M, 
referred to as bM(n,6(nf) or &M(n) for short, is an extension of KL-UCB index. Let f(n) = 
log(n) + 4 to log(log(?r)). 6 m(^ 5 9(n)) is the optimal value of the following optimization problem: 

maxM T <7 s.t. (Mt(n)) T kl(9(n), q) < f(n), (2) 


where we use the convention that for v, u £ R d , vu = (fitting[dj. As we show later, 6 m(«) may be 
computed efficiently using a line search procedure similar to that used to determine KL-UCB index. 


Our second index cm(ji , 9(n)) or cm ( ft ) for short is a generalization of the UCB1 and UCB-tuned 
indexes: 


c M {n) = M T 8(n) 


\ 


f(n) 


E 

u=i 


Mj 

U(n) 


Note that, in the classical bandit problems with independent arms, i.e., when m = 1, 6m( n) re¬ 
duces to the KL-UCB index (which yields an asymptotically optimal algorithm) and cm(h) reduces 
to the UCB-tuned index. The next theorem provides generic properties of our indexes. An impor¬ 
tant consequence of these properties is that the expected number of times where bM*{n,d(n)) or 
Cm* (n, 6(n)) underestimate p*(6) is finite, as stated in the corollary below. 
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Theorem 3 (i) For all n > 1, M £ M and r £ [0, l] d , we have bM^n, r) < cm( n, t). 

(ii) There exists C m > 0 depending on m only such that, for all M £ M and n > 2; 

P [b M (n,0(n)) < M T 6] < C' m rW 1 (log(n)) -2 . 

Corollary 2 E „>l p i b M* {n, 9(n)) < p*] < 1 + C m E„> 2 n~ 1 (\og (n)) -2 < oo. 

Statement (i) in the above theorem is obtained combining Pinsker and Cauchy-Schwarz inequalities. 
The proof of statement (ii) is based on a concentration inequality on sums of empirical KL diver¬ 
gences proven in l22l . It enables to control the fluctuations of multivariate empirical distributions 
for exponential families. It should also be observed that indexes &Ar(n) and cm(«) can be extended 
in a straightforward manner to the case of continuous linear bandit problems, where the set of arms 
is the unit sphere and one wants to maximize the dot product between the arm and an unknown 
vector. can also be extended to the case where reward distributions are not Bernoulli but 

lie in an exponential family (e.g. Gaussian, Exponential, Gamma, etc.), replacing kl by a suitably 
chosen divergence measure. A close look at cm(h) reveals that the indexes proposed in fTOl . ITTl . 
and (9) are too conservative to be optimal in our setting: there the “confidence bonus” Ei=i PTn) 

was replaced by (at least) m Ef = i tfinj ■ Note that fTOl . liTTtt assume that the various basic actions 
are arbitrarily correlated, while we assume independence among basic actions. When independence 
does not hold, CD provides a problem instance where the regret is at least 0( ff ld log(T)). This 

does not contradict our regret upper bound (scaling as log(T))), since we have added the 

independence assumption. 

4.2.2 Index computation 

While the index cm(h) is explicit, 1>m ( n ) is defined as the solution to an optimization problem. We 
show that it may be computed by a simple line search. For A > 0, w £ [0,1] and v £ N, define: 

g( A, w , v) = ^1 — Xv + i/(l — Xv) 2 + 4uwA^ /2. 

Fix n, M, 9{n) and t(n). Define I = {i : Af = 1, §i(n ) 1}, and for A > 0, define: 

F(A) = 

i£l 


Theorem 4 If I = 0, &Af(n) = ||M||r. Otherwise: (i) A i —> F( A) is strictly increasing, and 
^(K" 1 ") = K + . (ii) Define A* as the unique solution to F (A) = f(n). Then bM(n) = | |M| |i — | J| + 

E ie jS(A*,0i(n),ii(n)). 

Theorem |4] shows that b\f(n) can be computed using a line search procedure such as bisection, 
as this computation amounts to solving the nonlinear equation F( A) = /(n), where F is strictly 
increasing. The proof of Theorem [4] follows from KKT conditions and the convexity of KL diver¬ 
gence. 

4.2.3 The ESCB Algorithm 

The pseudo-code of ESCB is presented in Algorithm[l] We consider two variants of the algorithm 
based on the choice of the index ^Af(tt): ESCB-1 when ^Ar(^) = &Af(tt) an d ESCB-2 if (n) = 
c M (n). In practice, ESCB-1 outperforms ESCB-2. Introducing ESCB-2 is however instrumental 
in the regret analysis of ESCB-1 (in view of Theorem [3] (i)). The following theorem provides a 
finite time analysis of our ESCB algorithms. The proof of this theorem borrows some ideas from 
the proof of fTTl Theorem 3], 

Theorem 5 The regret under algorithms 7r £ {ESCB-1, ESCB-2} satisfies for all T > 1; 

fT(T) < 16dVmA m lf(T) + Mm 3 A~l + C' m , 

where C' m > 0 does not depend on 9, d and T. As a consequence R n (T) = 0{dy/mAf^ m log(T)) 
when T —> oo. 
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Algorithm 1 ESCB 
for n > 1 do 

Select arm M(n) G argmaxMgAt £m(«)- 

Observe the rewards, and update i,(n) and di(n),Vi G A/(n). 

end for 


Algorithm 2 CombEXP 

Initialization: Set qo = u°, 7 = / s ' m ‘° — and ri = 7C, with C = A, . 

V™ tog ^ 1 +V /c ( c '™ 2 ‘ i +“) T m 

for n > 1 do 

Mixing: Let q ' n _ 1 = (1 - 7)^-1 + 7P°- 

Decomposition: Select a distribution p n -i over At such that Pn-i(M)M = mq l n _ 1 . 

Sampling: Select a random arm M(n) with distributionp n -i and incur a reward Y n = JA A';(n)Mi(n). 
Estimation: Let E„_i =E [MA/ T ], where Af has lawp n _i. SetX(n) = E„E^_ 1 A/(n), where 
is the pseudo-inverse of E n _i. 

Update: Set q n {i) oc q n -i (*) exp(?;A'i(n)), Mi G [d]. 

Projection: Set q n to be the projection of q n onto the set V using the KL divergence. 

end for 


ESCB with time horizon T has a complexity of 0{\Ad\T) as neither 6 m nor cm can be written 
as M T y for some vector y G R d . Assuming that the offline (static) combinatorial problem is 
solvable in O ( V (Al )) time, the complexity of CUCB algorithm in ifTOl and UP after T rounds is 
0(V(M)T). Thus, if the offline problem is efficiently implementable, i.e., V(A4) = 0(poly(d)), 
CUCB is efficient, whereas ESCB is not since |Ad| may have exponentially many elements. In §2.5 
of the supplement, we provide an extension of ESCB called Epoch-ESCB, that attains almost the 
same regret as ESCB while enjoying much better computational complexity. 

5 Adversarial Combinatorial Bandits under Bandit Feedback 

We now consider adversarial combinatorial bandits with bandit feedback. We start with the follow¬ 
ing observation: 

max X = max n T X, 
m<em neCo(M) 

with Co(At) the convex hull of M. We embed At in the (-/-dimensional simplex by dividing its 
elements by to. Let V be this scaled version of Co{ At). 

Inspired by OSMD lU3l [181. we propose the CombEXP algorithm, where the KL divergence 
is the Bregman divergence used to project onto V. Projection using the KL divergence is 
addressed in |23l . We denote the KL divergence between distributions q and p in V by 
KL {p,q) = 5Zie[d] p(*) l°g The projection of distribution q onto a closed convex set S of 
distributions is p* = argmin pe s KL(p, q). 

Let A be the smallest nonzero eigenvalue of E [MM T ], where M is uniformly distributed over Ai. 
We define the exploration-inducing distribution pP G V: p® = Vi G [d], and 

let p m j n = minj mp®. p° is the distribution over basic actions [d] induced by the uniform distri¬ 
bution over A1. The pseudo-code for CombEXP is shown in Algorithm |2] The KL projection 
in CombEXP ensures that mq n - 1 G Co(Af). There exists A, a distribution over Ai such that 
mq n _ 1 = A (M)M. This guarantees that the system of linear equations in the decomposition 
step is consistent. We propose to perform the projection step (the KL projection of q onto V) using 
interior-point methods ll24ll . We provide a simpler method in §3.4 of the supplement. The decom¬ 
position step can be efficiently implemented using the algorithm of 1251 . The following theorem 
provides a regret upper bound for CombEXP. 


Theorem 6 For all T> 1: /(Combexp (T) < 2 



rjlos 


f^min T 


n. 5/2 1 -1 

~X °§ f^min 
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For most classes of A4, we have /i min = 0(poly(d)) and m(d\ ) 1 = 0(1) 0]. For these classes, 
CombEXP has a regret of 0{^/m 3 dT log (d/m)), which is a factor \Jm log (d/m) off the lower 
bound (see Table[2|. 

It might not be possible to compute the projection step exactly, and this step can be solved up 
to accuracy e n in round n. Namely we find q n such that KL(q n , q n ) — min pS H KL(p, Qn) < £n- 
Proposition [l] shows that for e n = 0(n~ 2 log -3 (n)), the approximate projection gives the same 
regret as when the projection is computed exactly. Theorem[7 gives the computational complexity of 
CombEXP with approximate projection. When Co{ Ai) is described by polynomially (in d) many 
linear equalities/inequalities, CombEXP is efficiently implementable and its running time scales 
(almost) linearly in T. Proposition [T] and Theorem [7] easily extend to other OSMD-type algorithms 
and thus might be of independent interest. 

Proposition 1 If the projection step of CombEXP is solved up to accuracy 
e n = 0(n~ 2 log~ 3 (n)), we have: 

R CombEXP (T) < 2m 3 T (d + 'j log pf^ ln + 2?? ^- log/vL- 

Theorem 7 Assume that Co{M) is defined by c linear equalities and s linear inequalities. If the 
projection step is solved up to accuracy e n = (D(n~ 2 log^ 3 (n)), then CombEXP has time com¬ 
plexity 0(T[y/s(c + d) 3 log(T) + ci 4 ]). 

The time complexity of CombEXP can be reduced by exploiting the structure of A4 (See |[24l 
page 545]). In particular, if inequality constraints describing Co(M) are box constraints, the time 
complexity of CombEXP is 0(T[c 2 y/s(c + d) log(T) + d 4 ]). 

The computational complexity of CombEXP is determined by the structure of Co{M.) and Comb¬ 
EXP has 0{T\og{T)) time complexity due to the efficiency of interior-point methods. In con¬ 
trast, the computational complexity of ComBand depends on the complexity of sampling from A4. 
ComBand may have a time complexity that is super-linear in T (see M page 217]). For instance, 
consider the matching problem described in Section [2] We have c = 2m equality constraints and 
s = m 2 box constraints, so that the time complexity of CombEXP is: 0(m 5 T log(T)). It is noted 
that using Il26l Algorithm 1], the cost of decomposition in this case is 0(m 4 ). On the other hand, 
CombBand has a time complexity of O(m 10 F(T)), with F a super-linear function, as it requires 
to approximate a permanent, requiring 0(m 10 ) operations per round. Thus, CombEXP has much 
lower complexity than ComBand and achieves the same regret. 

6 Conclusion 

We have investigated stochastic and adversarial combinatorial bandits. For stochastic combinatorial 
bandits with semi-bandit feedback, we have provided a tight, problem-dependent regret lower bound 
that, in most cases, scales at least as 0((d — log(T)). We proposed ESCB, an algorithm 

with 0(d^/rriAj n l n log(T)) regret. We plan to reduce the gap between this regret guarantee and 
the regret lower bound, as well as investigate the performance of Epoch-ESCB. For adversarial 
combinatorial bandits with bandit feedback, we proposed the CombEXP algorithm. There is a gap 
between the regret of CombEXP and the known regret lower bound in this setting, and we plan to 
reduce it as much as possible. 
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Supplementary Materials and Proofs 


A Stochastic Combinatorial Bandits: Regret Lower Bounds 

A.l Proof of Theorem 1 

To derive regret lower bounds, we apply the techniques used by Graves and Lai to investigate 
efficient adaptive decision rules in controlled Markov chains. First we give an overview of their 
general framework. 

Consider a controlled Markov chain fA' n )„>o on a finite state space S with a control set U. The 
transition probabilities given control u £ U are parameterized by 9 taking values in a compact 
metric space 0: the probability to move from state x to state y given the control u and the parameter 
9 is p(x, y, u. 9). The parameter 9 is not known. The decision maker is provided with a finite set 
of stationary control laws G = {g\ ...., (]/<}• where each control law g :) is a mapping from S to 
U: when control law gj is applied in state x, the applied control is u = gj{x). It is assumed that if 
the decision maker always selects the same control law g, the Markov chain is then irreducible with 
stationary distribution 7r|. Now the reward obtained when applying control u in state x is denoted by 
r(x , u), so that the expected reward achieved under control law g is: fig(g) = ^ r(x, g{x))TT a e {x). 
There is an optimal control law given 9 whose expected reward is denoted by p,g = max ge c Pe{g)- 
Now the objective of the decision maker is to sequentially select control laws so as to maximize 
the expected reward up to a given time horizon T. As for MAB problems, the performance of a 
decision scheme can be quantified through the notion of regret which compares the expected reward 
to that obtained by always applying the optimal control law. 


Proof. The parameter 9 takes values in [0, l] d . The Markov chain has values in S = {0, 1 } d . The set 
of controls corresponds to the set of feasible actions Xi, and the set of control laws is also Xi. These 
laws are constant, in the sense that the control applied by control law M £ Xi does not depend on 
the state of the Markov chain, and corresponds to selecting action M. The transition probabilities 
are given as follows: for all x,y £ S, 

p(x , y ; M, 9) = p{y■ M,9)= pifa; M, 9), 

i£[d\ 


where for all i £ [d\, if Mi = 0, pt{ 0; M, 9) = 1, and if Mi = 1, Pi(yi\ M , 9) = 6?' (1 — di) 1 Vi . 
Finally, the reward r(y, M) is defined by r(y, M) = M T y. Note that the state space of the Markov 
chain is here finite, and so, we do not need to impose any cost associated with switching control 
laws (see the discussion on page 718 in QjO). 

We can now apply Theorem 1 in D3- Note that the KL number under action M is 


kl M (0, A) = ^ Mjkl(0i,Aj). 

From fl9l Theorem 1], we conclude that for any uniformly good rule 7r, 

. R*(T) 


lim inf ■ 


> c(0), 


T—> oo log(T) 

where c(9) is the optimal value of the following optimization problem: 


mf x M {p ~ Pm (9)), 

xm>0,M£M ^ 

MyiM* 

s.t. inf ^ XQkl®(9, A) > 1. 


AG B(8) 


Q^M* 


The result is obtained by observing that B(9) = Uaj/m* -®m( 0)> where 

B m (9) = {A £ 0 : MtmOi - A,,) = 0, Vi, u*(0) < u M ( A)|. 


(3) 

(4) 


□ 
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A.2 Proof of Theorem 2 


The proof proceeds in three steps. In the subsequent analysis, given the optimization problem P, we 
use val(P) to denote its optimal value. 

Step 1. In this step, first we introduce an equivalent formulation for problem (|3j above by simpli¬ 
fying its constraints. We show that constraint (|4]> is equivalent to: 


inf 22 kl(0j, Aj) 22 QiXq > 1, VM^M*. 

r> \ i (()) 




ieM\M* 


QeM 


Observe that: 


22 ^ki Q (0,A)= 22 x Q 22 Gw*, Aj) = 22 ki( fl i> A «) 22 ^ x q- 

Q=jLM* Q^M* ie[d] ie[d] Q^M* 


Fix M ^ M*. In view of the definition of Bm(0), we can find A £ Bm{9) such that Aj = 0 i; \/i £ 
([d] \ M) U M*. Thus, for the r.h.s. of the M -th constraint in (jdj), we get: 


X. I « kl ° (s ' A) = E E Q 


'iXQ 


Q^m 


te [d] 


Q^M* 


= A"E, E “«m<>E« 


agSmCO igAf \ M * q 

and therefore problem ([3]> can be equivalently written as: 

c{°) = inf 22 ZmO* - Mm(0)), 


■ XQ, 


M^M* 


s.t. inf 'S'' kl(0j, Aj) > 1, 

A £B m (0) ' 


(5) 

( 6 ) 




Next, we formulate an LP whose value gives a lower bound for c(6). Define A(Af) = (A£ 
[d]) with 

A:(M) — I l M \ M *l if i£M\M , 

I 0, otherwise. 

Clearly A (M) £ Bm{9), and therefore: 

inf 22 kl(0i, A,) 22 Qi x Q < 22 kl(0i, A,(M)) ^ QjXq, 

A eB M (6) q ieM\M* Q 


Then, we can write: 


c(0) > inf > A mXm 

*>n 


MjiM* 

s.t. 22 kl(0j, Aj(M)) 22 Qi x Q > 1) 

ieM\M* Q 


(7) 

( 8 ) 


For any M ^ Af* introduce: = max igM \M* kl(0j, Aj(M)). Now we form PI as follows: 

PI: inf > AmXm 

x>0 


M^M* 


S.t. 


y 22 q* x q - —’ ^ M *• 


i£M\M* Q 


9 m 


(9) 

( 10 ) 


Observe that c(0) > val(P1) since the feasible set of problem (j7j) is contained in that of PI. 
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Step 2. In this step, we formulate an LP to give a lower bound for val(PI). To this end, for any 
suboptimal basic action i £ [d], we define z t = Em MiXm- Further, we let 2 : = [zi, i £ [d]]. Next, 
we represent the objective of PI in terms of 2 , and give a lower bound for it as follows: 


53 a mxm = 53 xm 53 


A 


M 


MjtM* 


Then, defining 


M^M* ieM\M* 

= 13 53 

M^M* i£[d]\M* 

, . Am 

> mm —— —— • 
MjtM* \M \ M*\ 


\M\M* 


A 


M 


\M\M*\ 


Mi 


A 


M 


M^M* | M \ M*\ 

= m 53 

ie[d]\M* 


X] X] M'xm' 

i£[c£]\M* M'/M* 

13 ^ 


ie[d]\M* 


P2: inf 5(0) ^ * 

ie[d]\M 

9m 


s.t. 


^' n 71 r 


ieM\M 


yields: val(PI) > val(P2). 


Step 3. Introduce set 7~L satisfying property P(0) as stated in Section 4. Now define 


Z = 


fzeRl: V Zi>—,VMen\, 

1 z —' qm ' 


and 


P3: inf /3(0) 53 z %- 

t€[d]\M* 


Observe that val(P2) > val(P3) since the feasible set of P2 is contained in Z. The definition of H 
implies that Eie[d]\M* = E M e« E ie M\M* z i- lt then follows that 


val(P3) = y 


Men 


m 

9m 


>-Y. 


m 


aT^h ma x i 6 M\M* kl(0», A»(M)) 

m 


= E 


Mew ma XigM\M* kl jm\m*| 

The proof is completed by observing that: c(0) > val(P1) > val(P2) > val(P3). 


□ 
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Figure 1: Matchings in AC 4 , 4 : (a) The optimal matching M*, (b)-(g) Elements of 'H. 


A.3 Proof of Corollary 1 

Fix M ^ M*. For any i G M\ M *, we have: 

kl (> |M \ M*\ £ kl (^») @j) (By convexity of kl(„ .)) 

1 ' 1 jeM*\M 1 ' 1 j£M*\M 

(0i-0j) 2 


< 


< 


< 


_ i _ V __ 

^ 0 ,-( l-0j) 


\M\M*\ ^ 0,(1 — Qj) 


(1 -Oj? 


\m \ m*\ ^ (0,- 

1 ' 1 j£M*\M v J 


< 


1 


- 1 


min j e M*\M @j 

<i-i, 

a 

where the second inequality follows from the inequality kl (p, q) < for all (p, q ) £ [ 0 , l] z 

Moreover, we have that 


A 


M 


An 


R(Q\ — jxiin _ _ 1111x1 _ __ 1±J 

m^m* |M\M*| — maxAf | M \ M* | /c 

Applying Theorem 2, we get: 

c(») > £ 


m 


> Amina 1 ^ 1 , 


Men 


- 7 -v" £ - 

max igM \ M * kl (£ d j) fc(1 " ^ 


which gives the required lower bound and completes the proof. 


□ 


A.4 Examples of Scaling of the Lower Bound 
A.4.1 Matchings 

In the first example, we assume that M is the set of perfect matchings in the complete bipartite 
graph ICm,m , with \M \ = m! and d = to 2 . A maximal subset H of A4 satisfying property P(9) can 
be constructed by adding all matchings that differ from the optimal matching by only two edges, see 
Figure [l]for illustration in the case of m = 4. Here \H\ = and thus, \H\ scales as m 2 = d. 

A.4.2 Spanning trees 

Consider the problem of finding the minimum spanning tree in a complete graph AC at. This corre¬ 
sponds to letting M. be the set of all spanning trees in AC n, where A4 \ = N n ~ 2 (Cayley’s formula). 
In this case, we have d = (^) = N ^ N 2 11 , which is the number of edges of AC at, and m = N — 1. 
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(a) M * (b) (c) (d) (e) (f) (g) 

Figure 2: Spanning trees in /C 5 : (a) The optimal spanning tree M*, (b)-(g) Elements of TL. 
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(a) 


(b) 


(c) 




(d) (e) 


Figure 3: Routing in a grid: (a) Grid topology with source (red) and destination (blue) nodes, (b) 
Optimal path M*, (c)-(e) Elements of TL. 


A maximal subset TL of M satisfying property P{6) can be constructed by composing all span¬ 
ning trees that differ from the optimal tree by one edge only, see Figure [2] In this case, TL has 
d — m = elements. 

A.4.3 Routing in a grid 

Now we give an example, in which \TL\ is not scaling as 0(d). Consider routing in an iV-by-iV 
directed grid, whose topology is shown in Figure [3j a) where the source (resp. destination) node is 
shown in red (resp. blue). Here A4 is the set of all paths with m = 2 (TV — 1) edges. We 

further have d = 2 N(N — 1). In this example, elements of any maximal set TL satisfying P(9) do 
not cover all basic actions. For instance, for the grid shown in Figure [IJa), the two edges incident to 
the right lower corner do not appear in any arm in TL. It can be easily verified that in this case, \TL\ 
scales as N rather than N 2 = d. 


A.5 Lower Bound Example 

Here we provide an example, motivated by HQ, to investigate the tightness of the regret bounds of 
our algorithms. Consider the topology shown in Figure|4j where there are ^ paths, each consisting 
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of m links. Let parameter 9 be defined such that 

[0.5 if i belongs to the first path 

Ui = < 

[0.5 — o otherwise. 

The first path is the optimal path and for any M / 1 we have: A m = A = rnd. Since various paths 
are independent, this problem reduces to a classical MAB problem with ^ arms. It is observed that 
the total reward of each path is the sum of m independent Bernoulli random variables with the same 
parameter. Hence, it is distributed according to a binomial distribution. It then follows that 



Figure 4: Lower bound example 


lim inf 
T->oo log(T) 


~ Jfji,* KL(Bin(m,0.5-5),Bin(m,0.5)) 



(d — m) A 
— 4 m 2 S 2 


d — m 
4A 


A 

rokl(0.5 — S , 0.5) 


where the first equality follows from the fact that the KL divergence between two Binomial distri¬ 
butions with respective parameters (m, u ) and (m, v) is mkl(u, v), and where the last step is due to 

inequality kl(z, y) < for all x, y G (0,1). 


B Stochastic Combinatorial Bandits: Regret Analysis of ESCB 

We use the convention that for v, u £ R d , uu = (viUi) ie u]. 

B.l A concentration inequality 

We first recall Lemma[I] a concentration inequality derived in | 22] Theorem 2]. 

Lemma 1 There exists a number C m > 0 depending only on m such that, for all M and all n > 2: 
P[(Mt(n)) T k\(9(n),9) > f(n)} < C' m n -1 (log(n)) -2 . 


B.2 Proof of Theorem 3 


First statement: 

Consider q £ 0, and apply the Cauchy-Schwartz inequality: 


M 


_ _ ^ \,r. 

(q ~ 0 H) = Vti(n)(qi - 6i(n))—=P= 


< 


\ 


y, M z ti(n)(qi - 6i(n)y 


i =1 




Mi 


r-f ti(n ) 
2 = 1 v ' 


E 
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By Pinsker’s inequality, for all (p, q ) £ [0, l] 2 we have 2(p — q) 2 < kl(p, q) so that: 


M T (g-0{n)) < 


( Mt{n)) T kl(9(n),q ) 


A 


£ 

»=1 


Mi 

ti(n) 


Hence, (Mf(n)) T kl(0(n), q) < f(n) implies: 


M T q = M T 9{n) +M T {q- 9(n)) < M T 9(n) + 


f(r 


E 


Mi 


\ 2 t i ( n ) 


= c M (n). 


so that, by definition of (n), we have bj^(n) < 

Second statement: 

If ( Mt(n)) T kl(9(n ), 9) < f(n) then, by definition of &m(^) we have 6m(«) > M T 9. Therefore, 
using Lemma[T| there exists C m such that for all n > 2 we have: 

P[b M (n) < M t 9] < P[{Mt{n)) T k\{9{n),9) > f(n)\ < C^rT^log(n))~ 2 , 


which concludes the proof. 


B.3 Proof of Theorem 4 

We recall the following facts about the KL divergence kl, for all p £ [0,1]: 

(i) q H>■ kl(p, q) is strictly convex on [ 0 , 1 ] and attains its minimum at p, with kl(p,p) = 0 . 

(ii) Its derivative with respect to the second parameter q K > kl '(p,q) = is strictl y 

increasing on (p, 1 ). 

(iii) Forp < 1, we have kl(p, q) —> oo and kl^p, q) —> oo. 

q— >- 1 “ q— ^ 1 “ 

Consider M and n fixed throughout the proof. Define I = {* £ M : 9i(n) ^ 1}. Consider q* £ 0 
the optimal solution of optimization problem: 

max M T q 
gee 

s.t. ( Mt{n)) T k\{9(n),q ) < /(n). 

so that 6 m(«) = M T q*. Consider i qL M , then M T q does not depend on g, and from (i) we get 
qi = 9i(n). Now consider i £ M. From (i) we get that 1 > q* > 9tin). Hence q* = 1 if 9i(ri) = 1. 
If / is empty, then q* = 1 for all i £ M, so that &m(«) = \ \M\ |x. 

Consider the case where 7^0. From (iii) and the fact that t(n) T k\(9(n), q*) <oowe get 9i(n) < 
q* < 1. From the Karush-Kuhn-Tucker (KKT) conditions, there exists A* > 0 such that for all 
i £ 7: 

1 = X*ti(n)kY (§i(n),qi). 

For A > 0 define 9i(n) < qi(X) < 1 a solution to the equation: 

1 = Xti(n)kl\9i(n) ,qi(X)). 

From (i) we have that A i-> q t ( A) is uniquely defined, is strictly decreasing and 9i(n) < qi(X) < 1. 
From (iii) we get that gj(R + ) = 4(n), 1]. Define the function: 

F W =^2 t i(n)kl(§(n),qi(X)). 
i£l 

From the reasoning below, F is well defined, strictly increasing and F( R + ) = K + . Therefore, A* is 
the unique solution to F( A*) = /(n), and q* = q.^X*). Furthermore, replacing kl' by its expression 
we obtain the quadratic equation: 

q t ( A ) 2 + q i (X)(Xti(n) - 1) - Xti{n)9i(n) = 0. 

Solving for q t { A), we obtain that g.;(A) = g( A, 9 l (n), t t (n)), which concludes the proof. □ 
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B.4 Proof of Theorem 5 


To prove Theorem 5, we borrow some ideas from proof of ifTTl Theorem 3]. 

For any n € N, s G R d , and M £ M define h n}Si M = \JJ2i=i and introduce the 
following events: 

G n = {(M*t(n)) r \d(0(n),9) > f(n)}, 

Hi, n = {Mi(n) = 1, \9i{n) - 6i\ > m -1 A min /2}, H n = U 
Fn = {A M(n ) A 

Then the regret can be bounded as: 

T T T 

R n (T) = E[£ A M(n)] < E[E A M(n)(l{G„} + l{H n })} + E[£ AAf(„)l{G Tl , H n }] 

n—1 n—1 n—1 

T T 

< mE[£(l{G„} + 1 {H n })} + E[£ 

n=l n=l 

since A M („) < m. 

Next we show that for any n such that M(n ) 7 ^ M*, it holds that G n UH n C F n . Re¬ 
call that cm{ti) > 6m (4 for any M and n (Theorem 3). Moreover, if G n holds, we have 
(M*f(?r)) T kl(0(n), 0) < /(n), which by definition of 6 m implies: 6 m* (4 > M* T 9. Hence 
we have: 

l{Gn, Sn, M(n) 7^ M*} = lie;, TT n , &,(„)(«) > £ M *(n)} 

<l{Sn, c M(n) (n)>M* T 0 } 

= l{f4, M(n) T 0(n) + h nAn)>M{n) > M* t 9} 

< t{M(n) T 9 + A M („)/2 + /tn,i(n),M(n) > M* T 0} 

= ^ A^ffn)} 

< l{2/l T ,t(n),M(n) > A M (ra)} 

= 1 {^}, 

where the second inequality follows from the fact that event G n implies: M(n) T 9{n) < M(n) T 9 + 
A min /2 < M{n) T 9 + A M(n) /2. 

Hence, the regret is upper bounded by: 

T T T 

R*(T) < mE[J2 1{G„}] + mE[^ 1 {H n }\ + E[£ A M(n) l {F n }}. 

n—1 n= 1 n= 1 

We will prove the following inequalities: (i) E[]G ^ =1 t{G n }] < m~ 1 C' m , with C' m > 0 inde¬ 
pendent of 9, d, and T, (ii) E[]G ^ =1 l{H n }} < 4dm 2 A“? n , and (iii) E [£^ =1 A M(n )l{F n }] < 
l6dVmA~lf(T). 

Hence as announced: 

R*(T) < 16dV^A~lj(T) + Mm 3 A~l + C' m . 

Inequality (i): An application of Lemma[l]gives 

T T 

E[£l{G n }] = ^P[(M*t(n)) T kl(0(n)4) > f(n)} 

n—1 n—1 

< 1 + ^2 G m.n~ t {\og{n))~ 2 = m~ 1 C , m < oo. 

n> 2 
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Inequality (ii): Fix i and n. Define s = Observe that H n > t i implies = 1, 

hence U(n) > s. Therefore, applying |[27l Lemma B.l], we have that J2n= 1 1P[-Hn,i] A 4m 2 A“ 2 n . 
Using the union bound: J2n=i P[Ai] A 4d?7i 2 A“ 2 n . 


Inequality (iii): Let £ > 0. For any n introduce the following events: 

S n = {i£ M(n) : U(n) < 4m/(T)A^ 2 (n) }, 

A n = {|5„| > £}, 

Bn = {|5 n | < £, [3i £ M{n) : t t (n) < 4£f(T)A~^ n) ]}. 

We claim that for any n such that M{n) 7 ^ M *, we have F n C ( A n U B n ). To prove this, we 
show that when F n holds and M(n) 7 ^ M*, the event A n U B n cannot happen. Let n be a time 
instant such that M(n) 7 ^ M* and F n holds, and assume that A n U B n = {|S n | < [V* £ M(n) : 

ti(n) > 4f/(T)A^ 2 n) ]} happens. Then F n implies: 


AM(n) A ,M(n) — ^ 


m 


\ 


Mi(ri) Mi{n) 

^U(n) tAri) 


< 2 


f(T) 


/ ^ M(n ) , „ , ^M(ra) . 

m 4 mf(T) + |5n| Uf(T) K Am( " ); 


( 11 ) 


where the last inequality uses the observation that A n U B n implies \S n \ < £. Clearly, (11 1 is a 
contradiction. Thus F n C (A n U B n ) and consequently: 


E AM(n)l{-Fn} A A M („)l{y4„} + E A M („)1 {B n }. (12) 

n= 1 n= 1 n —1 

To further bound the r.h.s. of the above, we introduce the following events for any i: 

A i>n = A n n{i £ M (n), ti(n) < 4m/(T)A^ 2 n) }, 

Bi, n = B n n{i £ M(n), ti{n ) < 4f/(T)A^ 2 n) }. 

It is noted that: 

E l{A i>n } = 1 {An} J2 1{* € s n } = |S„|i{A„} > a {An}, 
i€[d\ i€[d\ 

and hence: l{A n } < \ J2ie[d\ l{A,n}- Moreover l{B n } < J2i&id] 1 {Bi,n}- Let each basic 
action i belong to K, suboptimal arms, ordered based on th eir g aps as: A *’ 1 > • • • > A l,Ki > 0. 
Also dehne A 4,0 = oo. Plugging the above inequalities into (|12|>, we have 


T d 


A 


E A M („)l{f 7 ' n ,} A E E -^E{A,n} + E E A M ( n )l{i3i,n} 


T d 


n=l 2=1 
T d 


A 


= E E M(n) 7^ M*} + E E A M (n)l{A,n, M(n) 7^ M*} 


T d 


n—1i—1 
T d 


A i ' k 


n—1i—1 
T d 


<EE E —l{A,n, M(n) = fc} + ^^ E Af(n) = fc} 

n=l i=l ke[Ki\ n= 1 i=l fce[iCi] 

^EE E £ M(n), U(n) A 4m/(T)(A i > fc )- 2 , M(n) = k} 

i= 1 n=l 
d T 

+EE E A ilfc l{* £ M(n), ti(n) < Mf(T){A % ' k )~ 2 , M{n) = k } 

i—1 n= 1 feG[Jfi] 

8 d/(T) / ui 


< 


An 


(t + A 
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where the last inequality follows from Lemma [2] which is proven next. The proof is completed by 
setting £ = y/rn. □ 


Lemma 2 Let C > 0 be a constant independent of n. Then for any i such that Ki > 1: 


T Ki 


M(n), ti(n) < C(A i,k )~ 2 , M(n) = k}A i < k < 


2 C 


n—1k—1 


Proof. We have: 

T Ki 

E e M («), *i(") < C(A l ’ k )~ 2 , M(n) = k}A i,k 

n—1k—1 

T Ki k 

= EEE l{i G M(n ), ti(n) G (C , (A iJ - 1 )- 2 ,C(A i ^)- 2 ] J M(n) = k}A l ’ k 

n=1k—1j—1 
T Ki k 

^EEE l{i G M(n), ti(n) G (C'(A i ’- J_1 ) -2 , C'(A ij ) -2 ], M(n) = k}A l ’ j 

n—1k—1j—1 
T Ki Ki 

^EEE 1{* G M(n), U(n) G (C'(A ij - 1 ) _2 , C'(A ij ) -2 ], M{n) = k} A iJ 

n=lfc=lj=l 
T Ki 

^EE l{i G M(n), U(n ) G (C'(A i ’- 7_1 ) -2 , C^A*’ 3 ’) -2 ], M(n) ^ M*}A*’- 7 
n=lJ =1 

< +E c '(( Ai,j r 2 - (a 4 * 3- - 1 )- 2 )^ 

f=2 

,A ; ’ 2 


< 


c 

A'- 1 


/ C* 2 dx < . „ - , 

I A i>K i - - A min 


2C 2(7 

< 


which completes the proof. 


□ 


B.5 Epoch-ESCB: An algorithm with lower computational complexity 

ESCB with time horizon T has a complexity of 0(|Ad|T) as neither 5 m nor Cm can be written as 
M T y for some vector y G W 1 . Since A4 typically has exponentially many elements, we deduce that 
ESCB is not computationally efficient. Assuming that the offline (static) combinatorial problem is 
solvable in 0(V(M.)) time, the complexity of CUCB algorithm in flOl and ifTTl after T rounds is 
0(V(M.)T). Thus, if the offline problem is efficiently implementable, i.e., V(A4) = (7(poly(d)), 
CUCB is efficient, whereas ESCB is not. We next propose an extension to ESCB, called EPOCH- 
ESCB, that attains almost the same regret as ESCB while enjoying much better computational 
complexity. 

EPOCH-ESCB algorithm in epochs of varying lengths. Epoch k comprises rounds { N ..., iVfc + 1 — 
1}, where N^+i (and thus the length of the fc-th epoch) is determined at time n = N^. The algo¬ 
rithm simply consists in playing the arm with the maximal index at the beginning of every epoch, 
and playing the current leader (i.e., the arm with the highest empirical average reward) in the rest of 
rounds. If the leader is the arm with the maximal index, the length of epoch k will be set twice as 
long as the previous epoch k — 1, i.e., Nk+i = (Vfc + 2 (Nk — Nk-i)- Otherwise, it will be set to 1. 
In contrast to ESCB, Epoch-ESCB computes the maximal index infrequently, and more precisely 
(almost) at an exponentially decreasing rate. Thus, one might expect that after T rounds, the max¬ 
imal index will be computed (7(log(T)) times. The pseudo-code of Epoch-ESCB is presented in 
Algorithm [3] 

We assess the performance of Epoch-ESCB through numerical experiments in the next subsection, 
and leave the analysis of its regret as a future work. These experiments corroborate our conjecture 
that he complexity of Epoch-ESCB after T rounds will \ttO{V(M.)T + log(T)|Ad|). Compared 
to CUCB, the complexity is penalized by \M \ log(T), which may become dominated by the term 
V(M)T as T grows large. 
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Algorithm 3 Epoch-ESCB 
Initialization: Set k = 1 and N 0 = Ni = 1. 

for n > 1 do 

Compute L(n) £ argmaxMeA4 M T d(n). 

if n = Nk then 

Select arm M(n) £ argmaxMgAt Cm(w). 
if M (n) = L(n ) then 

Set ATfc+i = Afc + 2(TV* — Nk- 1 ). 

else 

Set Nk -)-i = TV* + 1. 

end if 

Increment k. 
else 

Select arm M (n) = L(n). 

end if 

Observe the rewards, and update ti(n) and 6i(n),Vi £ M(n). 

end for 




(a) 


(b) 


Figure 5: Regret of various algorithms for matchings with a = 0.7 and b = 0.5. 


B.6 Numerical Experiments 

In this section, we compare the performance of ESCB against existing algorithms through numer¬ 
ical experiments for some classes of AT When implementing ESCB we replace f(n) by log(n), 
ignoring the term proportional to log (log (n)), as is done when implementing KL-UCB in practice. 

B.6.1 Experiment 1: Matching 

In our first experiment, we consider the matching problem with Ni = N 2 = 5, which corresponds 
to d = 5 2 = 25 and m = 5. We also set 6 such that 0* = a if i £ M*, and = b otherwise, with 
0 < b < a < 1. In this case the lower bound becomes c{9) = ™M(b o)~^ ■ 

Figure [5j a)-(b) depicts the regret of various algorithms for the case of a = 0.7 and b = 0.5. The 
curves in Figure [5j a) are shown with a 95% confidence interval. We observe that ESCB-1 has 
the lowest regret. Moreover, ESCB-2 significantly outperforms CUCB and LLR, and is close to 
ESCB-1. Moreover, we observe that the regret of EPOCH-ESCBattains is quite close to that of 
ESCB-2. 

Figures | 6 |a)-(b) presents the regret of various algorithms for the case of a = 0.95 and b = 0.3. 
The difference compared to the former case is that ESCB-1 significantly outperforms ESCB-2. 
The reason is that in the former case, mean rewards of the most of the basic actions were close to 
1/2, for which the performance of UCB-type algorithms are closer to their KL-divergence based 
counterparts. On the other hand, when mean rewards are not close to 1/2, there exists a significant 
performance gap between ESCB-1 and ESCB-2. Comparing the results with the ‘lower bound’ 
curve, we highlight that ESCB-1 gives close-to-optimal performance in both cases. Furthermore, 
similar to previous experiment, EPOCH-ESCBattains a regret whose curve is almost indistinguish¬ 
able from that of ESCB-2. 
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Figure 6: Regret of various algorithms for matchings with a = 0.95 and b = 0.3. 



Figure 7: Number of epochs in EPOCH-ESCB vs. time for Experiment 1 and 2 (%95 confidence 
interval). 


The number of epochs in Epoch-ESCB vs. time for the two examples is displayed in Figure [TJa)- 
(b), where the curves are shown with 95% confidence intervals. We observe that in both cases, the 
number of epochs grows at a rate proportional to login)/n at round n. Since the number of epochs 
is equal to the number of times the algorithm computes indexes, these curves suggest that index 
computation after n rounds requires a number of operations that scales as |A4| log(n). 


B.6.2 Experiment 2: Spanning Trees 


In the second experiment, we consider spanning trees problem described in Section A.4.2 for the 
case of N = 5. In this case, we have d = ( 3 ) = 10, m = 4, and \M.\ = 5 3 = 125. 

Figure[8]portrays the regret of various algorithms with 95% confidence intervals, with A m i n = 0.54. 
Our algorithms significantly outperform CUCB and LLR. 



Figure 8: Regret of various algorithms for spanning trees with N = 5 and A m ; n = 0.54. 
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C Proofs for Adversarial Combinatorial Bandits 


C.l Proof of Theorem 6 


We first prove a simple result: 

Lemma 3 For all x £ R. d , we have E+^En- \X = x, where x is the orthogonal projection of x 
onto span(A4), the linear space spanned by A4. 

Proof: Note that for all y £ R d , if E„_i y = 0, then we have 

y r E n _iy = E [y T MM T y\ = E [(y T M) 2 ] = 0, (13) 

where Mhaslawp„_i such that M i p n ~ 1 (M) = q'^^i), Vi G [d] andg^ = (l-7)g„_i + 
By definition of pP, each M £ A4 has a positive probability. Hence, by ( |l3| , y T M = 0 for all 
M £ M. In particular, we see that the linear application E„_i restricted to span(M.) is invertible 
and is zero on span(f4) ± , hence we have ’E'^_ 1 T, n -ix = x. □ 

Lemma 4 We have for any rj < ^ 3/2 and any q £ V, 

£ « T X(n) - £ li-An) < | £ ^XHn) + 

n—1 n =1 n =1 ' 

where X' 2 (n) is the vector that is the coordinate-wise square of X(n). 


Proof: We have 


KL(g, q n ) - KL(g, g„_i) = ^ q(i) log 

i€[d] 


gn-l(i) 

Qn(i) 


= -V ^2 q{i)Xi{n) + log Z n , 

i(z[d] 


with 

log z n = log ^2 Qn-i(i) exp ( yXfn )) 

i£[d] 

< log ^2 Qn- 1 (*) (l + rjXiin) + y 2 X 2 (n)j (14) 

i(z[d ] 

< Wn-i^(n) + t? 2 gn-iA 2 (n), (15) 

where we used exp(^) < 1 + z 4- z 2 for all \z\ < 1 in (f4| and log(l + z) < z for all z > —1 in 
Later we verify the condition for the former inequality. 

Hence we have 

KL( 9 , q n ) - KL(q, q n -f) < mZ-iX(n) - yq T X(n) + y 2 q£_ 1 X 2 (n). 

Generalized Pythagorean inequality (see Theorem 3.1 in ll23l l gives 

KL(q, q n ) + KL (q n , q n ) < KL(g, q n ). 

Since KL(q„, q n ) > 0, we get 

KL(q, q n ) - KL(g, q n _x) < yq^Xiji) - yq T X(n) + y 2 q£_ 1 X 2 (n). 

Finally, summing over n gives 

(? T ^( n ) - gl-iA(n)) < y Qn-iX 2 ( n ) + KL ^ qo ^ . 
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To satisfy the condition for the inequality (14 1 , i.e., rj\Xi (n) \ < 1, Vi £ [d], we find the upper bound 
for max.j 6 [d] |Xj(n)| as follows: 

max|Xi(n)| < ||X(n )|| 2 
»e[d] 

= \\K~iM(n)Y n \\ 2 

< m||S+_ 1 M(n )|| 2 

< my / M(n) T S+_ 1 E+_ 1 M(n) 

< m 


||M(n)|| 2V /A max (E+_ 1 E+_ 1 ) 

= m 3/2 y / A max (E^ 

= m 3 / 2 A max (E+_ 1 ) 




,3/2 


Amin (E n _i) 

where A max (A) and A,,,;,, (A) respectively denote the maximum and the minimum nonzero eigen¬ 
value of matrix A. Note that //' induces uniform distribution over A4. Thus by q' n _i = 
(1 — 7 )g n _i + 7 /r° we see that p n -± is a mixture of uniform distribution and the distribution induced 
by q n -\. Note that, we have: 

A m i„ (E„_i) = min z T E n _i 2 :. 

\\x\\ 2 =l,x£span(M) 

Moreover, we have 

= E [x r M{n)M{n) T x] = E [(M(n) T xf] > 7 E [( M T x ) 2 ] , 

where in the last inequality M has law /j°. By definition, we have for any x £ span(A4) with 

IN | 2 = i, 


E [( M r x ) 2 ] > A, 


3/2 


so that in the end, we get A m in(S n _i) > 7A, and hence p\Xi(n)\ < r,m x , Vi £ [d]. Finally, we 


choose 77 < -^ff 2 to satisfy the condition for the inequality we used in (14 1 . 


We have 


□ 


E„ [X(n)\ = E n [r„E+_ 1 M(n)] = E„ [S+_ 1 M(n)M(n) T X(n)] = E+_ 1 E„_ 1 X(n) = X(n), 

where the last equality follows from Lemma pi and X (n) is the orthogonal projection of X(n) onto 
span( M). In particular, for any mq 1 £ Co(Xi), we have 


E„ 


mq 1 X(n) = mq 1 X(n) = mq 1 X(n). 


Moreover, we have: 


E„ 


gJ_iA' 2 (n) =^2q n - 1 (i)E n X?{n) 
i£[d] 


i^[d] 


1-7 


XKn) 


< 


1 


m(l — 7) 
1 

to(1 — 7) 


^2 mq’n-iii) E n Xf(n) 




En[^ Mi(n)Xf(n) 


i£[d] 
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where M(n ) is a random arm with the same law as M(n) and independent of M[ri). Note that 
Mf(n) = Mi(n), so that we have 

E„[^ Mi(n)Xf(n) = E„ X(n) T M(n)M(n) T E+_ 1 M(n)M(n) T E+_ 1 M(n)M(n) T X(n) 

i£[d] 

< m 2 E„[M(n) T S+_ 1 M(n)], 

where we used the bound M(n) T X{n) < to. By 0 Lemma 15], E„[M(n) T E^_ 1 M(n)] < d, so 
that we have: 

T T v2 / 0 - m d 

q„-iX 2 (n) 


Observe that 


E, 


q* T X(n) - q'J^Xin) 


E„ 

= E, 
= E^ 
< E, 


< En 


< 

1-7 

q* T X(n) - (1 - 7)«J-i^» - 7/1 i 0T X(n) 
q* T X(n) - qZ-iX(n) +'yqZ- 1 X(ri) - 7 n or X(n) 
q* T X(n) - ql_ x X{n)\ + 7 qZ-\X{n) 


q* T X(n) - ql_ x X{n) 


+ 7- 


Using Lemma[4]and the above bounds, we get with mq* the optimal arm, i.e. q*(i) = Z iff M* = 1, 
i?C oMBEXP (T) = mq* T X(n) - £ mq’Z_ 1 X(n) 


n—1 

T 


n—1 

T 


— tnq* T X(n) — ^ mqZ_iX{n) + TO7T 

n—1 n=l 

a grtr + + 

1-7 77 


since 


KL( g *,g 0 ) =-V log 77777 ° < log p, 

m. z —' 


-1 

'min’ 


ieM* 


Choosing 77 = 7 C with C = g ives 


R 


.Combexp^ ^ lCm 2 dT | TOlog/r m f n + ^ 


, (T) < 


7 


7 C 


Cm 2 d + TO — TO7 T m log Mini 


-1 

'min 


< 


1-7 

(■ Cm 2 d + m)jT + m log /t“. 


1-7 


7 C 
-1 
min 


7 c 


The proof is completed by setting 7 = 


\Jm log / 


V m lo S A‘ ml 1 n + '\/ C, ( C ' m2d + m ) T 


□ 


C.2 Proof of Proposition 1 

We first provide a simple result: 

Lemma 5 The KL-divergence z 1 —>• KL(z, q) is 1-strongly convex with respect to the || • ||i norm. 
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Proof. To prove the lemma, it suffices to show that for any x, y £ V: 

(VICL (x,q) - VKL(y, q)) T [x -y)> \\x - y\\\. 

We have 

(VKL(x, q) - VKL(y, q)) T (x - y) = ^ (i + log ^ - 1 - log (x(i) - y(i)) 

9(*)' 

= E^ 1 + iog^W - 1 - i°g y(i))(x{i) - y(i)) 

i(z[d\ 

= (v ^2 x{i)iogx(i) - v ^2 y(*) lo gy(*)) (x - y ) 

i£[d\ i€[d] 

>\\x-y\\l 

where the last inequality follows from strong convexity of the entropy function 2 H> ^ Zi log z, 
with respect to the || • ||-| norm |28l Proposition 5.1]. □ 


Recall that u n = arg min pe -p KL (p,q n ) and that q n is an e n -optimal solution for the projection 
step, that is 

KL qn) ^ KL(q ra , q n ) e n . 

By Lemma [5] we have 

KL ((/„,. qn) KL (Un, qn) ^ (qn t/ n ) VKL(zi n , q n ) ^ 11 Qn || i > ^ I Qn tt n || 2 ^ 

where we used (q n — u n ) T VKL(it„, q n ) > 0 due to first-order optimality condition for u n . Hence 

KL )q n ,q n ) KL {u ni q n ) L implies that ||p n ttnUoo \\Qn tt n ||i £ \/2c n . 

Consider q*, the distribution over V for the optimal arm, i.e. q*(i) = ^ iff M* = 1. Recall that 
from proof of Lemma [4] for q = q* we have 

KL (q*,q n ) - KL(g*,g„_i) < r)q^_ 1 X{n) - r/q* T X(n) + rj 2 q^X 2 (n). (16) 

Generalized Pythagorean Inequality (see Theorem 3.1 in J23|) gives 


KL(p , q n ) KL(p , u n ) -f- KL(u n , q n ). 
Let q n = mini qn{i)- Observe that 

KL (q*,u n ) = Y <?*(*) log Y log mu n (i) 

z ' ?/.. I ? i rr z ' 


ie[d] ,l 1*1 i£M< 

> Y log m(q n (i) + YY) > V (log mq n (i) + 

m z —' m z ' V 

iGM* 

\YYi i 

m 


i£M* 


V% e n 


In 


> -■ 


- — V log mq n {i) = -EE +KL (q*,q n ), 
q m z -—' p 

—n ieM* —n 


Plugging this into (17 1 , we get 

KL(p*, q n ) > KL(q*,q n ) - EE + KL(u„, q n ) > KL (q*,q n ) - EE. 

—n —n 

Putting this together with © yields 

KL(p*, g„) - KL(p*, q n -i) < - r)q* T X(n) + ?y 2 pJ_ 1 X 2 (n) + 


(17) 


Finally, summing over n gives 


T 

E 

n=l 


l 

( q* T X(n) - qYiXi.nj'j < rjJ^Y-i^ 2 ^) + 

n—1 


KL(g*, go) 
t? 


1 

v , g 

n=l —n 


26 












Defining 

_ (^logMmin) 

32n 2 log 3 (n) 

and recalling that KL(q*,qo) < log/r“ in , we get 

(<l* T X( n ) - Qn-1 X(nf) <vYl Qn-1 X 2 (n) + l ° g ^ min + 

n—1 n—1 '' 

<r,Y / Qn-iX 2 (n)+ 2l ° g ^\ 


Mn > 1 , 


lQ g Minin , l0 g Mmin ^ 


V V 32n 2 log (n + 1) 


where we used the fact X^n>i n 1 (log(n + 1)) 3 ^ 2 < 4. We remark that by the properties of KL 
divergence and since q' n _i > 7 / 1 0 > 0 , we have q > 0 at every round n, so that e n > 0 at every 
round n. 

Using the above result and following the same lines as in the proof of Theorem 6 , we have 

i?COMBEXP (T) < TfdT + 2m log Mmin + T _ 

1-7 77 

Choosing 77 = 7 C with C = ^2 gives 


jj)COMBEXP 


CO< 


{Cm 2 d + m)jT 2m log /xj, 

1-7 + 7 C 


The proof is completed by setting 7 


t/bmlog^Jn 

\/2m log r»“ 1 1 n + \/ C(Cm 2 d+m)T 


□ 


C.3 Proof of Theorem 7 

We calculate the time complexity of the various steps of CombEXP at round n> 1. 

(i) Mixing: This step requires 0(d) time. 

(ii) Decomposition: Using the algorithm of |j25l . the vector mq! n _ 1 may be represented as a 
convex combination of at most d + 1 arms in 0 (d 4 ) time, so that p n -\ may have at most 
d + 1 non-zero elements (observe that the existence of such a representation follows from 
Caratheodory Theorem). 

(iii) Sampling: This step takes 0(d) time since p n -\ has at most d + 1 non-zero elements. 

(iv) Estimation: The construction of matrix X ra _i is done in time 0(d 2 ) since p n has at most 
d+1 non-zero elements and MM T is formed in 0(d) time. Computing the pseudo-inverse 
of £„_! costs 0 (d 3 ). 

(v) Update: This step requires 0(d) time. 

(vi) Projection: The projection step is equivalent to solving a convex program up to accuracy 

e n = 0(n~ 2 log (n)). We use the Interior-Point Method (Barrier method). The total 
number of Newton iterations to achieve accuracy e n is 0(y / slog(s/e n )) |24] Ch. 11], 
Moreover, the cost of each iteration is 0((d + c) 3 ) If24l , Ch. 10], so that the total cost of this 
step becomes 0(^/s(c + d ) 3 log(s/e n )). Plugging e n = 0{n~ 2 log~ 3 (n)) and noting that 
0(X)n=i log (s/e n )) = 0(T log(T)), the cost of this step is 0(^/s(c + d) 3 Tlog(T)). 

Hence the total time complexity after T rounds is 0(T[y/s(c + d ) 3 log(T) + d 4 ]), which completes 
the proof. □ 
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C.4 Implementation: The Case of Graph Coloring 

In this subsection, we present an iterative algorithm for the projection step of CombEXP, for the 
graph coloring problem described next. 

Consider a graph G = (V. E ) consisting of to nodes indexed by i £ [to]. Each node can use one 
of the c > to available colors indexed by j G [c]. A feasible coloring is represented by a matrix 
M £ {0, l} mxc , where M,; 7 - = 1 if and only if node i is assigned color j. Coloring M is feasible 
if (i) for all i, node i uses at most one color, i.e., X]je[ c ] ^ {A N (A neighboring nodes are 
assigned different colors, i.e., for all i, i' £ [m], (i, i') £ E implies for all j £ [c], = 0. In 

the following we denote by KL = {KL(, £ £ [fc]} the set of maximal cliques of the graph G. We also 
introduce Kf, £ {0,1} such that Kpj = 1 if and only if node i belongs to the maximal clique JCe. 

There is a specific case where our algorithm can be efficiently implementable: when the convex hull 
Co(Xi) can be captured by polynomial in m many constraints. Note that this cannot be ensured 
unless restrictive assumptions are made on the graph G since there are up to 3 m//3 maximal cliques 
in a graph with m vertices |[29l . There are families of graphs in which the number of cliques 
is polynomially bounded. These families include chordal graphs, complete graphs, triangle-free 
graphs, interval graphs, and planar graphs. Note however, that a limited number of cliques does 
not ensure a priori that Co(Xi) can be captured by a limited number of constraints. To the best of 
our knowledge, this problem is open and only particular cases have been solved as for the stable set 
polytope (corresponding to the case c = 2, Xu = 1 and Xa = 0 with our notation) (30| . 

For the coloring problem described above we have 

Co(M) = Co{Vi, Mij < 1, W ,j, K ^ M n < !}■ (18) 

je[c] *e[m] 

Note that in the special case where G is the complete graph, such a representation becomes 

Co(M ) = Co{ J2 < i, Vi, M *j - lj 

ie[c] ie[m] 

We now give an algorithm for the projection a distribution p onto V using KL divergence. Since V 
is a scaled version of Co(A4), we give an algorithm for the projection of mp onto C>o(M.) given by 

CD- 

Set Ai(0) = Hj( 0) = 0 for all i, j and then define for t > 0, 

Vi £ [to], \i{t + 1) - log (Y, mpije~^ {t) ) (19) 

3 

^3 e [c], Hj{t+ 1) = max log j. (20) 


We can show that 


Proposition 2 Let p*^ = \un t ^ooPije Xi ^ Then mp* is the projection of mp onto Co(Xi) 

using the KL divergence. 


Although this algorithm is shown to converge, we must stress that the step ( [20} might be expensive 
as the number of distinct values of £ might be exponential in to. When G is a complete graph, this 
step is easy and our algorithm reduces to Sinkhorn’s algorithm (see j26l for a discussion). 

Proof: First note that the definition of projection can be extended to non-negative vectors thanks to 
the relation 

KL (p*,q) =minKL(p, q). 

p&S 

More precisely, given an alphabet A and a vector q £ we have for any probability vector p £ Rj 1 


A a ) log 

a£A 


P(a) 

q(a) 


> log 

a 


EaP(q) 

Ea 9(a) 


log 


1 


Nli 
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thanks to the log-sum inequality. Hence we see that p*(a) = is the projection of q onto the 
simplex of R+. 


Now define A,, = Co{Mij , JX My < 1} and Bfj = Co {My, K( :i Mij < 1}. Hence 
aAna,% = Co(Ai). By the argument described above, iteration (19i (resp. (20)) cor¬ 
responds to the projection onto A, (resp. fj £ Bt :) ) and the proposition follows from Theorem 5.1 in 

□ 


C.5 Examples 

In this subsection, we compare the performance of CombEXP against state-of-the-art algorithms 
(refer to Table 2 for the summary of regret of various algorithms). 


C.5.1 m-sets 


In this case, M. is the set of all d-dimensional binary vectors with rn ones. We have 


A^min — 


1 






m 

d 


Moreover, according to j4j Proposition 12], we have A = ■ When m = o(d ), the regret of 

CombEXP becomes 0{^/m 3 dT\og{d/m)), namely it has the same performance as ComBand 
and EXP2 WITH JOHN’S EXPLORATION. 


C.5.2 Matching 

Let A4 be the set of perfect matchings in /C m m , where we have d = 

1 i f (m - 1)! 

Mmin = mm — ) Mi = -j — = 


M 


to 2 and \M\ = ml We have 
1 

TO ’ 


Furthermore, from |4] Proposition 4] we have that A = m '_ 1 , thus giving /t >c °m bF ‘ ;xp (T) = 

0(y/m 5 T log(777-)), which is the same as the regret of ComBand and EXP2 WITH JOHN’S EX¬ 
PLORATION in this case. 


C.5.3 Spanning Trees 


In our next example, we assume that M. is the set of spanning trees in the complete graph ICn. In 
this case, we have d = (^), to = N — 1, and by Cayley’s formula A4 has N n ~ 2 elements. Observe 
that 


1 


M n 


N N ~ 2 ^ 


Mi 


(N — l) N ~ 3 

NN-2 > 


M 


which gives for N > 2 


l°g Mmin l 0 g ( {N _ jV-3 ^ 

= (TV — 3) log + log N 

<(N- 3) log 2 + log(TV) < 2 N. 


From |4| Corollary 7], we also get X > -l - For N > 6, the regret of ComBand takes the 

form 0(^/ N 5 T log(N)) since ^ < 7 when N > 6. Further, EXP2 WITH JOHN’S EXPLORATION 
attains the same regret. On the other hand, we get 

f?CoMBEXP( T ) = 0(y/N 5 Tlog(N)), N > 6, 


and therefore it gives the same regret as ComBand and EXP2 WITH JOHN’S EXPLORATION. 
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C.5.4 Cut sets 


Consider the case where M. is the set of balanced cuts of the complete graph K. 2 N, where a balanced 
cut is defined as the set of edges between a set of N vertices and its complement. It is easy to verify 
that d = ( 2 ^) and m = N 2 . Moreover, M. has ( 2 ^) balanced cuts and hence 


Minin = min ■ 


( 2N ) 

\N I M 

Moreover, by 0 Proposition 9], we have 


E M ' = 


/2JV—2\ 
V N-l ) 

( 2 £) 


N 


4iV — 2 ’ 


A=i 

4 


8N-7 


4(2JV — 1)(2N — 3) ’ 


N > 2, 


and consequently, the regret of COMBEXP becomes 0(N 4 VT) for N > 2, which is the same as 
that of ComBand and EXP2 with John’s Exploration. 
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